The data was collected and made available by the "National Institute of Diabetes and Digestive and Kidney Diseases" as part of the Pima Indians Diabetes Database.
Diabetes.csv is available from Kaggle. We have several questions - what information is more correlated with a positive diagnosis, and if we can only ask two questions to a patient, what should we ask and how would we give them a risk of being diagnosed.
This is a machine learning database, and normally we'd just extract features, feed to a ML algorithm and sit back and relax. But we'll get our hands dirty so that you can learn more.
We'll be using Python and some of its popular data science related packages. First of all, we will import pandas to read our data from a CSV file and manipulate it for further use.
We will also use numpy to convert out data into a format suitable to feed our classification model.
We'll use seaborn and matplotlib for visualizations. We will then import Logistic Regression algorithm from sklearn. This algorithm will help us build a classification model.
++++++++++++++++++++++++++++++++++++
The following features have been provided to help us predict whether a person is diabetic or not:
We can use the describe function and it will give us more details about the numeric columns
The scatter plot gives us both the histograms for the distributions along the diagonal, and also a lot of 2D scatter plots off-diagonal. Not that this is a symmetric matrix, so we normally just look at the diagonal and below/above it. We can see that some variables have a lot of scatter and some are correlated (ie there is a direction in their scatter). Which leads us to...
To easily quantify which variables / attributes are correlated with others!
https://www.statisticshowto.com/probability-and-statistics/correlation-analysis/
And you can see this is a symmetric matrix too. But it immedietly allows us to point out the most correlated and anti-correlated attributes. Some might just be common sense - Pregnancies v Age for example - but some might give us real insight into the data.
Covariance is a measure of how much two random variables vary together.
In other words, it is the measure of the joint spread between two random variables.
Useful when you have a lot of data ... i.e. at least 1000's of points
A bit hard to get information from the 2D histogram isnt it? Too much noise in the image. What if we try and contour diagram? We'll have to bin the data ourself. The contour API is here
With these plots we smooth the data ourselves. Seaborn to the rescue!
A scatter plot is normally fairly informative and very fast to plot.
Using the library ChainConsumer (examples here). written by Samuel Hinton.
It is easy to install:
pip install chainconsumer
Based on the previous Correlation plot, a simple approach might be just to use the top correlated variables and investigate them further. In our case, they're: Glucose, BMI and Age.
So its not perfect, but we can probably do an alright job approximating both these distributions as Gaussians.
Allows us to model Glucose, BMI and Age as a normal in 3 dimensions rather than 3 independent normals. That way we can have correlations between them.
More here: https://machinelearning-blog.com/2018/04/23/logistic-regression-101/
To get a better sense of what is going on inside the logistic regression model, we can visualize how our model uses the different features and which features have greater effect on the outcome